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Efficient computation of complex multiplication results and very 
efficient fast Fourier transforms (FFTs) are provided. A parallel array 
VLIW digital signal processor (100) is employed along with special- 
ized complex multiplication instructions and communication operations 
between the processing elements (101, 151, 153, 155) which are over- 
ed with computation to provide very high performance operation. 
Successive iterations of a loop of tightly packed VLIWs (100) are used 
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Efficient Complex Multiplication and 
Fast Fourier Transform (FFT) 
Implementation on the ManArray Architecture 

5 This application claims the benefit of U.S. Provisional Application Serial No. 

60/103,712 filed October 9, 1998 which is incorporated by reference in its entirety herein. 
Field of the Invention 

The present invention relates generally to improvements to parallel processing, and 
more particularly to methods and apparatus for efficiently calculating the result of a complex 
10 multiplication. Further, the present invention relates to the use of this approach in a very 
efficient FFT implementation on the manifold array ("ManArray") processing architecture. 
Background of the Invention 

The product of two complex numbers x and y is defined to be z = x^y^ -x;y; + 
i(xnyi + xiyn), where x = xr + ix/,y =yjt + iy; and / is an imaginary number, or the square 
15 root of negative one, with i 2 = -I. This complex multiplication of x and y is calculated in a 
variety of contexts, and it has been recognized that it will be highly advantageous to perform 
this calculation faster and more efficiently. 
Summary of the Invention 

The present invention defines hardware instructions to calculate the product of two 
20 complex numbers encoded as a pair of two fixed-point numbers of 1 6 bits each in two cycles 
with single cycle pipeline throughput efficiency. The present invention also defines 
extending a series of multiply complex instructions with an accumulate operation. These 
special instructions are then used to calculate the FFT of a vector of numbers efficiently. 

A more complete understanding of the present invention, as well as other features and 
25 advantages of the invention will he apparent from the following Detailed Description and the 
accompanying drawings. 
Brief Description of the Drawings 

Fig. 1 illustrates an exemplary 2x2 ManArray iVLIW processor; 
Fig. 2A illustrates a presently preferred multiply complex instruction, MPYCX; 
30 Fig. 2B illustrates the syntax and operation of the MPYCX instruction of Fig. 2A; 

Fig. 3A illustrates a presently preferred multiply complex divide by 2 instruction, 
MPYCXD2; 

Fig. 3B illustrates the syntax and operation of the MPYCXD2 instruction of Fig. 3 A; 
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Fig. 4A illustrates a presently preferred multiply complex conjugate instruction, 
MPYCXJ; 

Fig. 4B illustrates the syntax and operation of the MPYCXJ instruction of Fig. 4A; 
Fig. 5A illustrates a presently preferred multiply complex conjugate divide by two 
5 instruction, MPYCXJD2; 

Fig. 5B illustrates the syntax and operation of the MPYCXJD2 instruction of Fig. 5 A; 
Fig. 6 illustrates hardware aspects of a pipelined multiply complex and its divide by 
two instruction variant; 

Fig. 7 illustrates hardware aspects of a pipelined multiply complex conjugate, and its 
10 divide by two instruction variant; 

Fig. 8 shows an FFT signal flow graph; 

Fig. 9A-9H illustrate aspects of the implementation of a distributed FFT algorithm on 
a 2x2 ManArray processor using a VLrW algorithm with MPYCX instructions in a cycle-by- 
cycle sequence with each step corresponding to operations in the FFT signal flow graph; 
15 Fig. 91 illustrates how multiple iterations may be tightly packed in accordance with 

the present invention for a distributed FFT of length four; 

Fig. 9J illustrates how multiple iterations may be tightly packed in accordance with 
the present invention for a distributed FFT of length two; 

Figs. 10A and 10B illustrate Kronecker Product examples for use in reference to the 
20 mathematical presentation of the presently preferred distributed FFT algorithm; 

Fig. 1 1 A illustrates a presently preferred multiply accumulate instruction, MPYA; 

Fig. 1 IB illustrates the syntax and operation of the MPYA instruction of Fig. 1 1 A; 

Fig. 12A illustrates a presently preferred sum of 2 products accumulate instruction, 
SUM2PA; 

25 Fig. 12B illustrates the syntax and operation of the SUM2PA instruction of Fig. 12A; 

Fig. 13A illustrates a presently preferred multiply complex accumulate instruction, 
MPYCXA; 

Fig. 13B illustrates the syntax and operation of the MPYCXA instruction of Fig. 13 A; 
Fig. 14A illustrates a presently preferred multiply complex accumulate divide by two 
30 instruction, MPYCXAD2; 

Fig. 14B illustrates the syntax and operation of the MPYCXAD2 instruction of Fig. 

14A; 
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Fig. 1 5 A illustrates a presently preferred multiply complex conjugate accumulate 
instruction, MPYCXJA; 

Fig. 15B illustrates the syntax and operation of the MPYCXJA instruction of Fig. 

15A; 

5 Fig. 1 6A illustrates a presently preferred multiply complex conjugate accumulate 

divide by two instruction, MPYCXJAD2; 

Fig. 16B illustrates the syntax and operation of the MPYCXJAD2 instruction of Fig. 

16A; 

Fig. 1 7 illustrates hardware aspects of a pipelined multiply complex accumulate and 

10 its divide by two variant; and 

Fig. 1 8 illustrates hardware aspects of a pipelined multiply complex conjugate 
accumulate and its divide by two variant. 
Detailed Description 

Further details of a presently preferred ManArray architecture for use in conjunction 

15 with the present invention are found in U.S. Patent Application Serial No. 08/885,3 10 filed 
June 30, 1997, U.S. Patent Application Serial No. 08/949,122 filed October 10, 1997, U.S. 
Patent Application Serial No. 09/169,255 filed October 9, 1998, U.S. Patent Application 
Serial No. 09/169,256 filed October 9, 1998, U.S. Patent Application Serial No. 09/169,072 
filed October 9, 1 998, U.S. Patent Application Serial No. 09/1 87,539 filed November 6, 

20 1998, U.S. Patent Application Serial No. 09/205,558 filed December 4, 1998, U.S. Patent 
Application Serial No. 09/215,081 filed December 18, 1998, U.S. Patent Application Serial 
No. 09/228,374 filed January 12, 1999, U.S. Patent Application Serial No. 09/238,446 filed 
January 28, 1999, U.S. Patent Application Serial No. 09/267,570 filed March 12, 1999, as 
well as, Provisional Application Serial No. 60/092,130 entitled "Methods and Apparatus for 

25 Instruction Addressing in Indirect VLIW Processors" filed July 9, 1 998, Provisional 
Application Serial No. 60/103,712 entitled "Efficient Complex Multiplication and Fast 
Fourier Transform (FFT) Implementation on the ManArray" filed October 9, 1998, 
Provisional Application Serial No. 60/106,867 entitled "Methods and Apparatus for 
Improved Motion Estimation for Video Encoding" filed November 3, 1998, Provisional 

30 Application Serial No. 60/1 1 3,637 entitled "Methods and Apparatus for Providing Direct 

Memory Access (DMA) Engine" filed December 23, 1998 and Provisional Application Serial 
No. 60/1 13,555 entitled "Methods and Apparatus Providing Transfer Control" filed 
December 23, 1998, respectively, and incorporated by reference herein in their entirety. 
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In a presently preferred embodiment of the present invention, a ManArray 2x2 iVLIW 
single instruction multiple data stream (SIMD) processor 1 00 shown in Fig. 1 contains a 
controller sequence processor (SP) combined with processing element-0 (PEO) SP/PEO 101, 
as described in further detail in U.S. Application Serial No. 09/169,072 entitled "Methods 
5 and Apparatus for Dynamically Merging an Array Controller with an Array Processing 
Element". Three additional PEs 151, 153, and 155 are also utilized to demonstrate the 
implementation of efficient complex multiplication and fast fourier transform (FFT) 
computations on the ManArray architecture in accordance with the present invention. It is 
noted that the PEs can be also labeled with their matrix positions as shown in parentheses for 

10 PE0(PE00) 101, PE1 (PE01)151,PE2(PE10) 153,andPE3 (PE11) 155. 

The SP/PEO 101 contains a fetch controller 103 to allow the fetching of short 
instruction words (SIWs) from a 32-bit instruction memory 105. The fetch controller 103 
provides the typical functions needed in a programmable processor such as a program counter 
(PC), branch capability, digital signal processing, EP loop operations, support for interrupts, 

1 5 and also provides the instruction memory management control which could include an 
instruction cache if needed by an application. In addition, the SIW I-Fetch controller 103 
dispatches 32-bit SIWs to the other PEs in the system by means of a 32-bit instruction bus 
102. 

In this exemplary system, common elements are used throughout to simplify the 
20 explanation, though actual implementations are not so limited. For example, the execution 
units 1 3 1 in the combined SP/PEO 1 0 1 can be separated into a set of execution units 
optimized for the control function, e.g. fixed point execution units, and the PEO as well as the 
other PEs 15 1, 153 and 155 can be optimized for a floating point application. For the 
purposes of this description, it is assumed that the execution units 1 3 1 are of the same type in 
25 the SP/PEO and the other PEs. In a similar manner, SP/PEO and the other PEs use a five 
instruction slot iVLIW architecture which contains a very long instruction word memory 
(VIM) memory 109 and an instruction decode and VIM controller function unit 107 which 
receives instructions as dispatched from the SP/PEO's I-Fetch unit 103 and generates the VIM 
addresses-and-control signals 108 required to access the iVLIWs stored in the VIM. These 
30 iVLIWs are identified by the letters SLAMD in VIM 109. The loading of the iVLIWs is 

described in further detail in U.S. Patent Application Serial No. 09/1 87,539 entitled "Methods 
and Apparatus for Efficient Synchronous MIMD Operations with iVLIW PE-to-PE 
Communication". Also contained in the SP/PEO and the other PEs is a common PE 
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configurable register file 127 which is described in further detail in U.S. Patent Application 
Serial No. 09/169,255 entitled "Methods and Apparatus for Dynamic Instruction Controlled 
Reconfiguration Register File with Extended Precision". 

Due to the combined nature of the SP/PEO, the data memory interface controller 1 25 
5 must handle the data processing needs of both the SP controller, with SP data in memory 121 , 
and PEO, with PEO data in memory 123. The SP/PEO controller 125 also is the source of the 
data that is sent over the 32-bit broadcast data bus 126. The other PEs 151, 153, and 155 
contain common physical data memory units 123', 123", and 123"' though the data stored in 
them is generally different as required by the local processing done on each PE. The 
10 interface to these PE data memories is also a common design in PEs 1 , 2, and 3 and indicated 
by PE local memory and data bus interface logic 157, 157' and 157". Interconnecting the PEs 
for data transfer communications is the cluster switch 171 more completely described in U.S. 
Patent Application Serial No. 08/885,310 entitled "Manifold Array Processor", U.S. 
Application Serial No. 09/949,122 entitled "Methods and Apparatus for Manifold Array 
15 Processing", and U.S. Application Serial No. 09/169,256 entitled "Methods and Apparatus 
for ManArray PE-to-PE Switch Control". The interface to a host processor, other peripheral 
devices, and/or external memory can be done in many ways. The primary mechanism shown 
for completeness is contained in a direct memory access (DMA) control unit 181 that 
provides a scalable ManArray data bus 183 that connects to devices and interface units 
20 external to the ManArray core. The DMA control unit 181 provides the data flow and bus 
arbitration mechanisms needed for these external devices to interface to the ManArray core 
memories via the multiplexed bus interface represented by line 1 85. A high level view of a 
ManArray Control Bus (MCB) 191 is also shown. 

All of the above noted patents are assigned to the assignee of the present invention 
25 and incorporated herein by reference in their entirety. 
Special Instructions for Complex Multiply 

Turning now to specific details of the ManArray processor as adapted by the present 
invention, the present invention defines the following special hardware instructions that 
execute in each multiply accumulate unit (MAU), one of the execution units 131 of Fig. 1 and 
30 in each PE, to handle the multiplication of complex numbers: 

• MPYCX instruction 200 (Fig. 2A), for multiplication of complex numbers, where the 
complex product of two source operands is rounded according to the rounding mode 
specified in the instruction and loaded into the target register. The complex numbers 
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are organized in the source register such that halfword HI contains the real 
component and halfword HO contains the imaginary component. The MPYCX 
instruction format is shown in Fig. 2 A. The syntax and operation description 210 is 
shown in Fig. 2B. 

5 • MPYCXD2 instruction 300 (Fig. 3A), for multiplication of complex numbers, with 

the results divided by 2, Fig. 3, where the complex product of two source operands is 
divided by two, rounded according to the rounding mode specified in the instruction, 
and loaded into the target register. The complex numbers are organized in the source 
register such that halfword HI contains the real component and halfword HO contains 

10 the imaginary component. The MPYCXD2 instruction format is shown in Fig. 3 A. 

The syntax and operation description 310 is shown in Fig. 3B. 
• MPYCXJ instruction 400 (Fig. 4A), for multiplication of complex numbers where the 
second argument is conjugated, where the complex product of the first source operand 
times the conjugate of the second source operand, is rounded according to the 

15 rounding mode specified in the instruction and loaded into the target register. The 

complex numbers are organized in the source register such that halfword HI contains 
the real component and halfword HO contains the imaginary component. The 
MPYCXJ instruction format is shown in Fig. 4A. The syntax and operation 
description 4 10 is shown in Fig. 4B. 

20 • MPYCXJD2 instruction 500 (Fig. 5A), for multiplication of complex numbers where 

the second argument is conjugated, with the results divided by 2, where the complex 
product of the first source operand times the conjugate of the second operand, is 
divided by two, rounded according to the rounding mode specified in the instruction 
and loaded into the target register. The complex numbers are organized in the source 

25 register such that halfword HI contains the real component and halfword HO contains 

the imaginary component. The MPYCXJD2 instruction format is shown in Fig. 5A. 
The syntax and operation description 510 is shown in Fig. 5B. 
All of the above instructions 200, 300, 400 and 500 complete in 2 cycles and are 
pipeline-able. That is, another operation can start executing on the execution unit after the 

30 first cycle. All complex multiplication instructions return a word containing the real and 
imaginary part of the complex product in half words HI and HO respectively. 
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To preserve maximum accuracy, and provide flexibility to programmers, four possible 
rounding modes are defined: 

• Round toward the nearest integer (referred to as ROUND) 

• Round toward 0 (truncate or fix, referred to as TRUNC) 

• Round toward infinity (round up or ceiling, the smallest integer greater than or 
equal to the argument, referred to as CEIL) 

• Round toward negative infinity (round down or floor, the largest integer smaller 
than or equal to the argument, referred to as FLOOR). 

Hardware suitable for implementing the multiply complex instructions is 
shown in Fig. 6 and Fig. 7. These figures illustrate a high level view of the hardware 
apparatus 600 and 700 appropriate for implementing the functions of these instructions. This 
hardware capability may be advantageously embedded in the ManArray multiply accumulate 
unit (MAU), one of the execution units 1 3 1 of Fig. 1 and in each PE, along with other 
hardware capability supporting other MAU instructions. As a pipelined operation, the first 
execute cycle begins with a read of the source register operands from the compute register 
file (CRF) shown as registers 603 and 605 in Fig. 6 and as registers 1 1 1, 127, 127', 127", and 
127"' in Fig. 1 . These register values are input to the MAU logic after some operand access 
delay in halfword data paths as indicated to the appropriate multiplication units 607, 609, 
61 1, and 613 of Fig. 6. The outputs of the multiplication operation units, X R *Y R 607, X R *Yi 
609, X,*Y R 61 1, and Xj*Yi613, are stored in pipeline registers 615, 617, 619, and 621, 
respectively. The second execute cycle, which can occur while a new multiply complex 
instruction is using the first cycle execute facilities, begins with using the stored pipeline 
register values, in pipeline register 615, 617, 619, and 621, and appropriately adding in adder 
625 and subtracting in subtracter 623 as shown in Fig. 6. The add function and subtract 
function are selectively controlled functions allowing either addition or subtraction operations 
as specified by the instruction. The values generated by the apparatus 600 shown in Fig. 6 
contain a maximum precision of calculation which exceeds 16-bits. Consequently, the 
appropriate bits must be selected and rounded as indicated in the instruction before storing 
the final results. The selection of the bits and rounding occurs in selection and rounder 
circuit 627. The two 16-bit rounded results are then stored in the appropriate halfword 
position of the target register 629 which is located in the compute register file (CRF). The 
divide by two variant of the multiply complex instruction 300 selects a different set of bits as 
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specified in the instruction through block 627. The hardware 627 shifts each data value right 
by an additional 1 -bit and loads two divided-by-2 rounded and shifted values into each half 
word position in the target registers 629 in the CRF. 

The hardware 700 for the multiply complex conjugate instruction 400 is shown in Fig. 
5 7. The main difference between multiply complex and multiply complex conjugate is in 

adder 723 and subtracter 725 which swap the addition and subtraction operation as compared 
with Fig. 6. The results from adder 723 and subtracter 725 still need to be selected and 
rounded in selection and rounder circuit 727 and the final rounded results stored in the target 
register 729 in the CRF. The divide by two variant of the multiply complex conjugate 
10 instruction 500 selects a different set of bits as specified in the instruction through selection 
and rounder circuit 727. The hardware of circuit 727 shifts each data value right by an 
additional 1 -bit and loads two divided-by-2 rounded and shifted values into each half word 
position in the target registers 729 in the CRF. 
The FFT Algorithm 

15 The power of indirect VLIW parallelism using the complex multiplication instructions 

is demonstrated with the following fast Fourier transform (FFT) example. The algorithm of 
this example is based upon the sparse factorization of a discrete Fourier transform (DFT) 
matrix. Kronecker-product mathematics is used to demonstrate how a scalable algorithm is 
created. 

20 The Kronecker product provides a means to express parallelism using mathematical 

notation. It is known that there is a direct mapping between different tensor product forms 
and some important architectural features of processors. For example, tensor matrices can be 
created in parallel form and in vector form. J. Granata, M. Conner, R. Tolimieri, The Tensor 
Product: A Mathematical Programming Language for FFTs and other Fast DSP Operations, 

25 IEEE SP Magazine, January 1 992, pp. 40 - 48. The Kronecker product of two matrices is a 
block matrix with blocks that are copies of the second argument multiplied by the 
corresponding element of the first argument. Details of an exemplary calculation of matrix 
vector products 

y = (Jm®A)x 

30 are shown in Fig. 10A. The matrix is block diagonal with m copies of A. If vector x was 
distributed block-wise in m processors, the operation can be done in parallel without any 
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communication between the processors. On the other hand, the following calculation, shown 
in detail in Fig. 10B, 

y = (A®l m )x 

requires that x be distributed physically on m processors for vector parallel computation. 
5 The two Kronecker products are related via the identity 

I®A = P(A®I)P T 

where P is a special permutation matrix called stride permutation and P T is the transpose 
permutation matrix. The stride permutation defines the required data distribution for a 
parallel operation, or the communication pattern needed to transform block distribution to 

10 cyclic and vice- versa. 

The mathematical description of parallelism and data distributions makes it possible 
to conceptualize parallel programs, and to manipulate them using linear algebra identities and 
thus better map them onto target parallel architectures. In addition, Kronecker product 
notation arises in many different areas of science and engineering. The Kronecker product 

15 simplifies the expression of many fast algorithms. For example, different FFT algorithms 

correspond to different sparse matrix factorizations of the Discrete Fourier Transform (DFT), 
whose factors involve Kronecker products. Charles F. Van Loan, Computational Frameworks 
for the Fast Fourier Transform, SIAM, 1992, pp 78-80. 

The following equation shows a Kronecker product expression of the FFT algorithm, 

20 based on the Kronecker product factorization of the DFT matrix, 

F n =(F p ®IJD p JI p ®FJP np 

where: 

n is the length of the transform 
p is the number of PEs 

m = n/p 

25 The equation is operated on from right to left with the P n , p permutation operation 

occurring first. The permutation directly maps to a direct memory access (DMA) operation 
that specifies how the data is to be loaded in the PEs based upon the number of PEs p and 
length of the transform n. 

30 F n =(F p ®IJD p JI p ®F m )P n . p 
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where P np corresponds to DMA loading data with stride p to local PE memories. 

In the next stage of operation all the PEs execute a local FFT of length m=n/p with 
local data. No communications between PEs is required. 

where (I p ®F m ) specifies that all PEs execute a local FFT of length m sequentially, with 
local data. 

In the next stage, all the PEs scale their local data by the twiddle factors and 
collectively execute m distributed FFTs of length p. This stage requires inter-PE 
10 communications. 

F n =(F p ®IJD p JI p ®F m )P n , p 

where (F p ®I m )D pm specifies that all PEs scale their local data by the twiddle factors and 
collectively execute multiple FFTs of length p on distributed data. In this final stage of the 

15 FFT computation, a relatively large number m of small distributed FFTs of size p must be 
calculated efficiently. The challenge is to completely overlap the necessary communications 
with the relatively simple computational requirements of the FFT. 

The sequence of illustrations of Figs. 9A-9H outlines the ManArray distributed FFT 
algorithm using the indirect VLIW architecture, the multiply complex instructions, and 

20 operating on the 2x2 ManArray processor 100 of Fig. 1. The signal flow graph for the small 
FFT is shown in Fig. 8 and also shown in the right-hand-side of Figs. 9A-9H. In Fig. 8, the 
operation for a 4 point FFT is shown where each PE executes the operations shown on a 
horizontal row. The operations occur in parallel on each vertical time slice of operations as 
shown in the signal flow graph figures in Figs. 9A-9H. The VLIW code is displayed in a 

25 tabular form in Figs 9A-9H that corresponds to the structure of the ManArray architecture 
and the iVLIW instruction. The columns of the table correspond to the execution units 
available in the ManArray PE: Load Unit, Arithmetic Logic Unit (ALU), Multiply 
Accumulate Unit (MAU), Data Select Unit (DSU) and the Store Unit. The rows of the table 
can be interpreted as time steps representing the execution of different iVLIW lines. 

30 The technique shown is a software pipeline implemented approach with iVLIWs. In 

Figs. 9A-9I, the tables show the basic pipeline for PE3 155. Fig. 9A represents the input of 
the data X and its corresponding twiddle factor W by loading them from the PEs local 
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memories, using the load indirect (Lii) instruction. Fig. 9B illustrates the complex arguments 
X and W which are multiplied using the MPYCX instruction 200, and Fig. 9C illustrates the 
communications operation between PEs, using a processing element exchange (PEXCHG) 
instruction. Further details of this instruction are found in U.S. Application Serial No. 
5 09/169,256 entitled "Methods and Apparatus for ManArray PE-PE Switch Control" filed 
October 9, 1998. Fig. 9D illustrates the local and received quantities are added or subtracted 
(depending upon the processing element, where for PE3 a subtract (sub) instrution is used). 
Fig. 9E illlustrates the result being multiplied by -i on PE3, using the MPYCX instruction. 
Fig. 9F illustrates another PE-to-PE communications operation where the previous product is 

10 exchanged between the PEs, using the PEXCHG instruction. Fig. 9G illustrates the local and 
received quantities are added or subtracted (depending upon the processing element, where 
for PE3 a subtract (sub) instruction is used). Fig. 9H illustrates the step where the results are 
stored to local memory, using a store indirect (sii) instruction. 

The code for PEs 0, 1, and 2 is very similar, the two subtractions in the arithmetic 

15 logic unit in steps 9D and 9G are substituted by additions or subtractions in the other PEs as 
required by the algorithm displayed in the signal flow graphs. To achieve that capability and 
the distinct MPYCX operation in Fig. 9E shown in these figures, synchronous MIMD 
capability is required as described in greater detail in United States Patent Application Serial 
No. 09/187,539 filed November 6, 1998 and incorporated by reference herein in its entirety. 

20 By appropriate packing, a very tight software pipeline can be achieved as shown in Fig. 91 for 
this FFT example using only two VLIWs. 

In the steady state, as can be seen in Fig. 91, the Load, ALU, MAU, and DSU units are 
fully utilized in the two VLIWs while the store unit is used half of the time. This high 
utilization rate using two VLIWs leads to very high performance. For example, a 256-point 

25 complex FFT can be accomplished in 425 cycles on a 2x2 ManArray. 

As can be seen in the above example, this implementation accomplishes the 
following: 

• An FFT butterfly of length 4 can be calculated and stored every two cycles, using four 
PEs. 

30 • The communication requirement of the FFT is completely overlapped by the 

computational requirements of this algorithm. 
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• The communication is along the hypercube connections that are available as a subset 
of the connections available in the ManArray interconnection network. 

• The steady state of this algorithm consists of only two VLIW lines (the source code is 
two VLIW lines long). 

5 • All execution units except the Store unit are utilized all the time, which lead us to 

conclude that this implementation is optimal for this architecture. 
Problem Size Discussion 
The equation: 

F n =(F p ®IJD p JI p ®FJP n , p 

where: 

10 n is the length of the transform, 

p is the number of PEs, and 
m = n/p 

is parameterized by the length of the transform n and the number of PEs, where m=n/p relates 
to the size of local memory needed by the PEs. For a given power-of-2 number of processing 

15 elements and a sufficient amount of available local PE memory , distributed FFTs of size p 
can be calculated on a ManArray processor since only hypercube connections are required. 
The hypercube of p or fewer nodes is a proper subset of the ManArray network. When p is a 
multiple of the number of processing elements, each PE emulates the operation of more than 
one virtual node. Therefore, any size of FFT problem can be handled using the above 

20 equation on any size of ManArray processor. 

For direct execution, in other words, no emulation of virtual PEs, on a ManArray of 
size p, we need to provide a distributed FFT algorithm of equal size. For p=l, it is the 
sequential FFT. For p=2, the FFT of length 2 is the butterfly: 
Y0=x0+w*Xl,and 

25 Yl=x0-w*Xl 

where X0 and Y0 reside in or must be saved in the local memory of PE0 and XI and Yl on 
PE1 , respectively. The VLIWs in PE0 and PE1 in a 1x2 ManArray processor (p=2) that are 
required for the calculation of multiple FFTs of length 2 are shown in Fig. 9J which shows 
that two FFT results are produced every two cycles using four VLIWs. 
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Extending Complex Multiplication 

It is noted that in the two-cycle complex multiplication hardware described in Figs. 6 
and 7, the addition and subtraction blocks 623, 625, 723, and 725 operate in the second 
execution cycle. By including the MPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 
5 instructions in the ManArray MAU, one of the execution units 131 of Fig. 1, the complex 
multiplication operations can be extended. The ManArray MAU also supports multiply 
accumulate operations (MACs) as shown in Figs. 1 1 A and 12A for use in general digital 
signal processing (DSP) applications. A multiply accumulate instruction (MPYA) 1 100 as 
shown in Fig. 1 1 A, and a sum two product accumulate instruction (SUM2PA) 1200 as shown 

10 in Fig. 12 A, are defined as follows. 

In the MPYA instruction 1 100 of Fig. 1 1 A, the product of source registers Rx and Ry 
is added to target register Rt. The word multiply form of this instruction multiplies two 32- 
bit values producing a 64-bit result which is added to a 64-bit odd/even target register. The 
dual halfword form of MPYA instruction 1 100 multiplies two pairs of 16-bit values 

1 5 producing two 32-bit results: one is added to the odd 32-bit word, the other is added to the 
even 32-bit word of the odd/even target register pair. Syntax and operation details 1 1 10 are 
shown in Fig. 1 1 B. In the SUM2PA instruction 1200 of Fig. 1 2A, the product of the high 
halfwords of source registers Rx and Ry is added to the product of the low halfwords of Ac 
and Ry and the result is added to target register Rt and then stored in Rt. Syntax and 

20 operation details 1210 are shown in Fig. 12B. 

Both MPYA and SUMP2A generate the accumulate result in the second cycle of the 
two-cycle pipeline operation. By merging MPYCX, MPYCXD2, MPYCXJ, and 
MPYCXJD2 instructions with MPYA and SUMP2A instructions, the hardware supports the 
extension of the complex multiply operations with an accumulate operation. The 

25 mathematical operation is defined as: Z T = Z R + X R Y R - X, Yj + i(Zi + X R Yi + X] Y R ), 
where X = X R + iXj, Y = Y R + iYi and i is an imaginary number, or the square root of 
negative one, with i 2 = -1. This complex multiply accumulate is calculated in a variety of 
contexts, and it has been recognized that it will be highly advantageous to perform this 
calculation faster and more efficiently. 

30 For this purpose, an MPYCXA instruction 1300 (Fig. 1 3 A), an MPYCXAD2 

instruction 1400 (Fig. 14A), an MPYCXJA instruction 1500 (Fig. 15 A), and an 
MPYCXJ AD2 instruction 1 600 (Fig. 1 6A) define the special hardware instructions that 
handle the multiplication with accumulate for complex numbers. The MPYCXA instruction 
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1300, for multiplication of complex numbers with accumulate is shown in Fig. 13. Utilizing 
this instruction, the accumulated complex product of two source operands is rounded 
according to the rounding mode specified in the instruction and loaded into the target register. 
The complex numbers are organized in the source register such that halfword HI contains the 
5 real component and halfword HO contains the imaginary component. The MPYCXA 

instruction format is shown in Fig. 13 A. The syntax and operation description 1310 is shown 
in Fig. 13B. 

The MPYCXAD2 instruction 1400, for multiplication of complex numbers with 
accumulate, with the results divided by two is shown in Fig. 14A. Utilizing this instruction, 

1 0 the accumulated complex product of two source operands is divided by two, rounded 
according to the rounding mode specified in the instruction, and loaded into the target 
register. The complex numbers are organized in the source register such that halfword HI 
contains the real component and halfword HO contains the imaginary component. The 
MPYCXAD2 instruction format is shown in Fig. 14 A. The syntax and operation description 

15 1410 is shown in Fig. 14B. 

The MPYCXJA instruction 1500, for multiplication of complex numbers with 
accumulate where the second argument is conjugated is shown in Fig. 1 5 A. Utilizing this 
instruction, the accumulated complex product of the first source operand times the conjugate 
of the second source operand, is rounded according to the rounding mode specified in the 

20 instruction and loaded into the target register. The complex numbers are organized in the 
source register such that halfword HI contains the real component and halfword HO contains 
the imaginary component. The MPYCXJA instruction format is shown in Fig. 15A. The 
syntax and operation description 1510 is shown in Fig. 15B. 

The MPYCXJAD2 instruction 1600, for multiplication of complex numbers with 

25 accumulate where the second argument is conjugated, with the results divided by two is 
shown in Fig. 1 6A. Utilizing this instruction, the accumulated complex product of the first 
source operand times the conjugate of the second operand, is divided by two, rounded 
according to the rounding mode specified in the instruction and loaded into the target register. 
The complex numbers are organized in the source register such that halfword HI contains the 

30 real component and halfword HO contains the imaginary component. The MPYCXJAD2 

instruction format is shown in Fig. 16A. The syntax and operation description 1610 is shown 
in Fig. 16B. 
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All instructions of the above instructions 1 100, 1200, 1300, 1400, 1500 and 1600 
complete in two cycles and are pipeline-able. That is, another operation can start executing 
on the execution unit after the first cycle. All complex multiplication instructions 1300, 
1400, 1 500 and 1600 return a word containing the real and imaginary part of the complex 
5 product in half words HI and HO respectively. 

To preserve maximum accuracy, and provide flexibility to programmers, the same 
four rounding modes specified previously for MPYCX, MPYCXD2, MPYCXJ, and 
MPYCXJD2 are used in the extended complex multiplication with accumulate. 

Hardware 1700 and 1800 for implementing the multiply complex with accumulate 

1 0 instructions is shown in Fig. 17 and Fig. 1 8, respectively. These figures illustrate the high 
level view of the hardware 1700 and 1800 appropriate for these instructions. The important 
changes to note between Fig. 17 and Fig. 6 and between Fig. 1 8 and Fig. 7 are in the second 
stage of the pipeline where the two-input adder blocks 623, 625, 723, and 725 are replaced 
with three-input adder blocks 1723, 1725, 1823, and 1825. Further, two new half word 

1 5 source operands are used as inputs to the operation. The Rt.H 1 1 73 1 ( 1 83 1 ) and Rt.H0 1 733 
(1833) values are properly aligned and selected by multiplexers 1735 (1835) and 1737 (1837) 
as inputs to the new adders 1723 (1823) and 1725 (1825). For the appropriate alignment, 
Rt.Hl is shifted right by 1-bit and Rt.H0 is shifted left by 15-bits. The add/subtract, add/sub 
blocks 1723 (1823) and 1725 (1825), operate on the input data and generate the outputs as 

20 shown. The add function and subtract function are selectively controlled functions allowing 
either addition or subtraction operations as specified by the instruction. The results are 
rounded and bits 30-15 of both 32-bit results are selected 1727 (1827) and stored in the 
appropriate half word of the target register 1 729 ( 1 829) in the CRF. It is noted that the 
multiplexers 1735 (1835) and 1737 (1837) select the zero input, indicated by the ground sym 

25 bol, for the non-accumulate versions of the complex multiplication series of 

instructions. 

While the present invention has been disclosed in the context of various aspects of 
presently preferred embodiments, it will be recognized that the invention may be suitably 
applied to other environments consistent with the claims which follow. 
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We claim: 

1 . An apparatus for the efficient processing of complex multiplication 
computations, the apparatus comprising: 

at least one controller sequence processor (SP); 
5 a memory for storing process control instructions; 

a first multiply complex numbers instruction stored in the memory and operative to 
control the PEs to carry out a multiplication operation involving a pair of complex numbers; 
and 

hardware for implementing the first multiply complex numbers instruction. 
10 2. The apparatus of claim 1 further comprising a plurality of processing elements 

(PEs) interconnected with said SP and arranged in an N x N array interconnected in a 
manifold array interconnection network. 

3. The apparatus of claim 1 wherein the first multiply complex instruction 
completes execution in 2 cycles. 
1 5 4. The apparatus of claim 1 wherein the first multiply complex instruction is 

tightly pipelineable. 

5. The apparatus of claim 1 wherein each complex number is stored as a word, 
each word comprising a first half word and a second half word, with a real component of 
each complex number being stored as the first half word and an imaginary component of each 

20 complex number being stored as the second half word. 

6. The apparatus of claim 1 wherein the first multiply complex instruction 
includes a plurality of rounding modes, the rounding modes including: 

rounding toward a nearest integer; 
rounding toward zero; 
25 rounding toward infinity; and 

rounding toward negative infinity. 

7. The apparatus of claim 1 wherein the first multiply complex numbers 
instruction 

is one of the following group of instructions: a multiply complex numbers (MPYCX), a 
30 multiply complex numbers instruction (MPYCXJ) operative to carry out the multiplication of 
a pair of complex numbers where an argument is conjugated, a multiply complex numbers 
instruction (MPYCXD2) operative to carry out the multiplication of a pair of complex 
numbers with a result divided by two, and a multiply complex numbers instruction 
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(MPYCXJD2) operative to carry out the multiplication of a pair of complex numbers where 
an argument is conjugated with a result divided by two. 

8. The apparatus of claim 1 further comprising a multiply accumulate unit 
including the memory for storing the first multiply complex numbers instruction. 
5 9. The apparatus of claim 8 wherein the multiply accumulate unit operates in 

response to a multiply accumulate instruction (MPYA) to extend a multiplication operation 
with an accumulate operation. 

1 0. The apparatus of claim 8 wherein the multiply accumulate unit operates in 
response to a sum two product accumulate instruction (SUM2PA) to extend two 

10 multiplication operations with an accumulate operation. 

1 1 . The apparatus of claim 9 wherein the multiply accumulate unit operates in 
response to a multiply complex with accumulate instruction (MPYCXA) to carry out the 
multiplication of a pair of complex numbers with accumulation of a third complex number. 

12. The apparatus of claim 1 1 wherein the MPYCXA instruction completes 
15 execution in 2 cycles. 

1 3 . The apparatus of claim 1 2 wherein the MPYCXA instruction is tightly 
pipelineable. 

14. The apparatus of claim 1 further comprising one or more of the following 
additional instructions (MPYCXA, MPYCXAD2, MPYCXJA or MPYCXJAD2) stored in 

20 the memory to carry out complex multiplication operations pipelined in 2 cycles. 

15. A method for the computation of an FFT by a plurality of processing elements 
(PEs), the method comprising the steps of: 

loading input data from a memory into each PE in a cyclic manner; 
calculating a local FFT by each PE; 
25 multiplying by the twiddle factors and calculating a FFT by the cluster of PEs; and 

loading the FFTs into the memory. 

16. A method for the computation of a distributed FFT by an N x N processing 
element (PE) array, the method comprising the steps of: 

loading a complex number x and a corresponding twiddle factor w from a memory 
30 into each of the PEs; 

calculating a first product by the multiplication of the complex numbers x and w; 
transmitting the first product from each of the PEs to another PE in the N x N array; 
receiving the first product and treating it as a second product in each of the PEs; 
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selectively adding or subtracting the first product and the second product to form a 
first result; 

calculating a third product in selected PEs; 

transmitting the first result or third product in selected PEs to another PE in the N x N 



selectively adding or subtracting the received values to form a second result; and 
storing the second results in the memory. 

1 7. A method for efficient computation by a 2 x 2 processing element (PE) array 
interconnected in a manifold array interconnection network, the array comprising four PEs 
1 0 (PEO, PE 1 , PE2 and PE3), the method comprising the steps of: 

loading a complex number x and a corresponding twiddle factor w from a memory 
into each of the four PEs, complex number x including subparts xO, xl, x2 and x3, twiddle 
factor w including subparts wO, wl , w2 and w3; 

multiplying the complex numbers x and w, such that 
15 PEO multiplies xO and wO to produce a productO, 



5 



array; 



PE1 multiplies xl and wl to produce a product 1 , 
PE2 multiplies x2 and w2 to produce a product2, and 
PE3 multiplies x3 and w3 to produce a product3; 



20 



transmitting the productO, the productl, the product2 and the product3, such that 
PEO transmits the productO to PE2, 
PE1 transmits the productl to PE3, 
PE2 transmits the product2 to PEO, and 
PE3 transmits the product3 to PE1 ; and 



25 



performing arithmetic logic operations, such that 

PEO adds the productO and the product2 to produce a sum tO, 

PE1 adds the productl and the product3 to produce a sum t2, 

PE2 subtracts the product2 from the productO to produce a sum tl, and 

PE3 subtracts the product3 from the productl to produce a result which is 



multiplied by -i to produce a sum t3. 



30 



1 8. The method of claim 1 7 further comprising the steps of : 



transmitting the sums tO, tl, t2 and t3, such that 
PEO transmits tO to PE1, 
PE1 transmits 12 to PEO, 
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PE2 transmits tl to PE3, and 

PE3 transmits t3 to PE2; 
performing the arithmetic logic operations, such that 

PEO adds tO and t2 to produce a yO, 
5 PE 1 subtracts t2 from tO to produce a y 1 , 

PE2 adds tl and t3 to produce a y2, and 

PE3 subtracts t3 from tl to produce a y3; and 
storing yO, yl, y2 and y3 in a memory. 

19. A special hardware instruction for handling the multiplication with accumulate 
10 for two complex numbers from a source register whereby utilizing said instruction and 
accumulated complex product of two source operands is rounded according to a rounding 
mode specified in the instruction and loaded into a target register with the complex numbers 
organized in the source such that a halfword (HI) contains the real component and a halfword 
(HO) contains the imaginary component. 
15 20. The special hardware instruction of claim 19 wherein the accumulated 

complex product is divided by two before it is rounded. 

21 . An apparatus to efficiently fetch instructions including complex multiplication 
instructions and an accumulate form of multiplication instructions from a memory element 
and dispatch the fetched instruction to at least one of a plurality of multiply complex and 

20 multiply with accumulate execution units to carry out the instruction specified operation, the 
apparatus comprising: 

a memory element; 

means for fetching said instructions from the memory element; 
a plurality of multiply complex and multiply with accumulate execution units; and 
25 means to dispatch the fetched instruction to at least one of said plurality of execution 

units to carry out the instruction specified operation. 

22. The apparatus of claim 2 1 further comprising: 

an instruction register to hold a dispatched multiply complex instruction (MPYCX); 
means to decode the MPYCX instruction and control the execution of the MPYCX 
30 instruction; 

two source registers each holding a complex number as operand inputs to the multiply 
complex execution hardware; 

four multiplication units to generate terms of the complex multiplication; 
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four pipeline registers to hold the multiplication results; 

an add function which adds two of the multiplication results from the pipeline 
registers for the imaginary component of the result; 

a subtract function which subtracts two of the multiplication results from the pipeline 
5 registers for the real component of the result; 

a round and select unit to format the real and imaginary results; and 

a result storage location for saving the final multiply complex result, whereby the 
apparatus is operative for the efficient processing of multiply complex computations. 

23. The apparatus of claim 21 wherein the means for fetching said instructions is a 
10 sequence processor (SP) controller. 

24. The apparatus of claim 22 wherein the round and select unit provides a shift 
right as a divide by 2 operation for a multiply complex divide by 2 instruction (MPYCXD2). 

25. The apparatus of claim 21 further comprising: 

an instruction register to hold a dispatched multiply complex instruction (MPYCXJ); 
1 5 means to decode the MPYCXJ instruction and control the execution of the MPYCXJ 

instruction; 

two source registers each holding a complex number as operand inputs to the multiply 
complex execution hardware; 

four multiplication units to generate terms of the complex multiplication; 
20 four pipeline registers to hold the multiplication results; 

an add function which adds two of the multiplication results from the pipeline 
registers for the real component of the result; 

a subtract function which subtracts two of the multiplication results from the pipeline 
registers for the imaginary component of the result; 
25 a round and select unit to format the real and imaginary results; and 

a result storage location for saving the final multiply complex conjugate result, 
whereby the apparatus is operative for the efficient processing of multiply complex conjugate 
computations. 

26. The apparatus of claim 25 wherein the round and select unit provides a shift 
30 right as a divide by 2 operation for a multiply complex conjugate divide by 2 instruction 

(MPYCXJD2). 

27. The apparatus of claim 21 further comprising: 
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an instruction register to hold the dispatched multiply accumulate instruction 
(MPYA); 

means to decode the MPYA instruction and control the execution of the MPYA 
instruction; 

5 two source registers each holding a source operand as inputs to the multiply 

accumulate execution hardware; 

at least two multiplication units to generate two products of the multiplication; 
at least two pipeline registers to hold the multiplication results; 
at least two accumulate operand inputs to the second pipeline stage accumulate 
10 hardware; 

at least two add functions which each adds the results from the pipeline registers with 
the third accumulate operand creating two multiply accumulate results; 

a round and select unit to format the results if required by the MPYA instruction; and 
a result storage location for saving the final multiply accumulate result, whereby the 
1 5 apparatus is operative for the efficient processing of multiply accumulate computations. 

28. The apparatus of claim 21 further comprising: 

an instruction register to hold a dispatched multiply accumulate instruction 
(SUM2PA); 

means to decode the SUM2PA instruction and control the execution of the SUM2PA 
20 instruction; 

at least two source registers each holding a source operand as inputs to the SUM2PA 
execution hardware; 

at least two multiplication units to generate two products of the multiplication; 
at least two pipeline registers to hold the multiplication results; 
25 at least one accumulate operand input to the second pipeline stage accumulate 

hardware; 

at least one add function which adds the results from the pipeline registers with the 
third accumulate operand creating a SUM2PA result; 

a round and select unit to format the results if required by the SUM2PA instruction; 

30 and 

a result storage location for saving the final result, whereby the apparatus is operative 
for the efficient processing of sum of 2 products accumulate computations 

29. The apparatus of claim 21 further comprising: 
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an instruction register to hold the dispatched multiply complex accumulate instruction 
(MPYCXA); 

means to decode the MPYCXA instruction and control the execution of the 
MPYCXA instruction; 

5 two source registers each holding a complex number as operand inputs to the multiply 

complex accumulate execution hardware; 

four multiplication units to generate terms of the complex multiplication; 
four pipeline registers to hold the multiplication results; 
at least two accumulate operand inputs to the second pipeline stage accumulate 
10 hardware; 

an add function which adds two of the multiplication results from the pipeline 
registers and also adds one of the accumulate operand input for the imaginary component of 
the result; 

a subtract function which subtracts two of the multiplication results from the pipeline 
15 registers and also adds the other accumulate operand input for the real component of the 
result; 

a round and select unit to format the real and imaginary results; and 
a result storage location for saving the final multiply complex accumulate result, 
whereby the apparatus is operative for the efficient processing of multiply complex 
20 accumulate computations. 

30. The apparatus of claim 29 wherein the round and select unit provides a shift 
right as a divide by 2 operation for a multiply complex accumulate divide by 2 instruction 
(MPYCXAD2). 

3 1 . The apparatus of claim 2 1 further comprising: 

25 an instruction register to hold the dispatched multiply complex conjugate accumulate 

instruction (MPYCXJA); 

means to decode the MPYCXJA instruction and control the execution of the 
MPYCXJA instruction; 

two source registers each holding a complex number as operand inputs to the multiply 
30 complex accumulate execution hardware; 

four multiplication units to generate terms of the complex multiplication; 
four pipeline registers to hold the multiplication results; 
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at least two accumulate operand inputs to the second pipeline stage accumulate 
hardware; 

an add function which adds two of the multiplication results from the pipeline 
registers and also adds one of the accumulate operand input for the real component of the 
5 result; 

a subtract function which subtracts two of the multiplication results from the pipeline 
registers and also adds the other accumulate operand input for the imaginary component of 
the result; 

a round and select unit to format the real and imaginary results; and 
10 a result storage location for saving the final multiply complex conjugate accumulate 

result, whereby the apparatus is operative for the efficient processing of multiply complex 
conjugate accumulate computations. 

32. The apparatus of claim 3 1 wherein the round and select unit provides a shift 
right as a divide by 2 operation for a multiply complex conjugate accumulate divide by 2 

15 instruction (MPYCXJAD2). 

33. The apparatus of claim 21 wherein the complex multiplication instructions and 
accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, 
MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJ AD2 instructions, and all of 
these instructions complete execution in 2 cylces. 

20 34. The apparatus of claim 21 wherein the complex multiplication instructions and 

accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, 
MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJ AD2 instructions, and all of 
these instructions are tightly pipelineable. 

35. An apparatus for the efficient processing of an FFT, the apparatus comprising: 
25 at least one controller sequence processor (SP); 

a plurality of processing elements (PEs) arranged in an NxN array interconnected in a 
manifold (ManArray) interconnection network; and 

a memory for storing instructions to be processed by the SP and by the array of PEs. 

36. The apparatus of claim 22 wherein the add function and subtract function are 
30 selectively controlled functions allowing either addition or subtraction operations as specified 

by the instruction. 
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37. The apparatus of claim 25 wherein the add function and subtract function are 
selectively controlled functions allowing either addition or subtraction operations as specified 
by the instruction. 

38. The apparatus of claim 29 wherein the add function and subtract function are 

5 selectively controlled functions allowing either addition or subtraction operations as specified 
by the instruction. 

39. The apparatus of claim 3 1 wherein the add function and subtract function are 
selectively controlled functions allowing either addition or subtraction operations as specified 
by the instruction. 
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